Improved Typesetting Models for Historical OCR

نویسندگان

Taylor Berg-Kirkpatrick

Dan Klein

چکیده

We present richer typesetting models that extend the unsupervised historical document recognition system of BergKirkpatrick et al. (2013). The first model breaks the independence assumption between vertical offsets of neighboring glyphs and, in experiments, substantially decreases transcription error rates. The second model simultaneously learns multiple font styles and, as a result, is able to accurately track italic and nonitalic portions of documents. Richer models complicate inference so we present a new, streamlined procedure that is over 25x faster than the method used by BergKirkpatrick et al. (2013). Our final system achieves a relative word error reduction of 22% compared to state-of-the-art results on a dataset of historical newspapers.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Font group identification using reconstructed fonts

Ideally, digital versions of scanned documents should be represented in a format that is searchable, compressed, highly readable, and faithful to the original. These goals can theoretically be achieved through OCR and font recognition, re-typesetting the document text with original fonts. However, OCR and font recognition remain hard problems, and many historical documents use fonts that are no...

متن کامل

OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus

This article describes the results of a case study that applies Neural Networkbased Optical Character Recognition (OCR) to scanned images of books printed between 1487 and 1870 by training the OCR engine OCRopus (Breuel et al. 2013) on the RIDGES herbal text corpus (Odebrecht et al. 2017, in press). Training specific OCR models was possible because the necessary ground truth is available as err...

متن کامل

OCR and post-correction of historical Finnish texts

This paper presents experiments on Optical character recognition (OCR) as a combination of Ocropy software and data-driven spelling correction that uses Weighted Finite-State Methods. Both model training and testing were done on Finnish corpora of historical newspaper text and the best combination of OCR and post-processing models give 95.21% character recognition accuracy.

متن کامل

Segmentation of Handwritten Characters for Digitalizing Korean Historical Documents

The historical documents are valuable cultural heritages and sources for the study of history, social aspect and life at that time. The digitalization of historical documents aims to provide instant access to the archives for the researchers and the public, who had been endowed with limited chance due to maintenance reasons. However, most of these documents are not only written by hand in ancie...

متن کامل

Painfree LaTeX with Optical Character Recognition and Machine Learning

Recent years have seen an increasing interest in harnessing advancements machine learning (ML) and optical character recognition (OCR) to convert physical and handwritten documents into digital versions. The increasing adoption of digital documents in academia, however, has provided a new layer of complexity to automatic digitization of physical documents. Compared to typical texts written in n...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Improved Typesetting Models for Historical OCR

نویسندگان

چکیده

منابع مشابه

Font group identification using reconstructed fonts

OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus

OCR and post-correction of historical Finnish texts

Segmentation of Handwritten Characters for Digitalizing Korean Historical Documents

Painfree LaTeX with Optical Character Recognition and Machine Learning

عنوان ژورنال:

اشتراک گذاری